Skip to content

Conversation

@guptaNswati
Copy link

Addressing #101 to add device attributes to demonstrate how to do resourceclaim status update.

Test run

    devices:
    - conditions:
      - lastTransitionTime: "2025-11-13T22:22:56Z"
        message: ""
        reason: GPUDeviceReady
        status: "True"
        type: Ready
      data:
        driverVersion:
          version: 1.0.0
        uuid:
          string: gpu-18db0e85-99e9-c746-8531-ffeb86328b39
      device: gpu-0

  devices:
    - conditions:
      - lastTransitionTime: "2025-11-13T22:22:56Z"
        message: ""
        reason: GPUDeviceReady
        status: "True"
        type: Ready
      data:
        driverVersion:
          version: 1.0.0
        uuid:
          string: gpu-93d37703-997c-c46f-a531-755e3e0dc2ac
      device: gpu-1

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 13, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @guptaNswati!

It looks like this is your first PR to kubernetes-sigs/dra-example-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/dra-example-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: guptaNswati
Once this PR has been reviewed and has the lgtm label, please assign bart0sh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2025
@guptaNswati
Copy link
Author

cc @nojnhuh for review.

@nojnhuh nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Nov 14, 2025
return resultConfigs, nil
}

func (s *DeviceState) buildDeviceStatus(res *resourceapi.DeviceRequestAllocationResult) *resourceapply.AllocatedDeviceStatusApplyConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does res need to be a pointer, or can we save the extra syntax and theoretical nil-dereference by making this a plain value? Not a blocking issue, but I may propose refactoring this when I integrate with #129.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


func (s *DeviceState) buildDeviceStatus(res *resourceapi.DeviceRequestAllocationResult) *resourceapply.AllocatedDeviceStatusApplyConfiguration {
dn := res.Device
deviceInfo := make(map[string]interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we make this more strongly typed?

Suggested change
deviceInfo := make(map[string]interface{})
deviceInfo := make(map[string]resourceapi.DeviceAttribute)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

WithDriver(res.Driver).
WithPool(res.Pool).
WithConditions(cond).
WithData(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, who or what generally consumes this data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this seems useful for monitoring and debugging when admin want to know pod to gpu mapping, which GPU is being used by a particular pod. right now this info is not readily available.

}
cond := metav1apply.Condition().
WithType("Ready").
WithStatus(metav1.ConditionTrue).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the NVIDIA driver also only set this condition at the time the claim is allocated, or can it ever change after that? If it can change, I think the example driver should model how to update that. Even if it stays constant, scaffolding out how that can work with something that says "this is where the latest status is determined" would be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for nvidia-dra-driver i was looking to use this as part of GPU health check status update. But we dint really see any value of showing this information as part of an allocated claim as it does not really change. Its more complex to show any real updates on an allocated claim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think for setting device conditions, the interesting part isn't the API call to update the condition, but actually calculating that value of the condition asynchronously. I think I'd rather include that (or no condition at all) than a hardcoded condition added during NodePrepareResources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay since we don't have device health checking rn, i can just change it to just allocated for example purposes with a comment or remove it.

// TODO: This condition currently reflects only that the device was allocated, not that
// it is healthy or ready for use. True readiness should be computed asynchronously by
// the health-monitoring pipeline (e.g., NVML Xid/error tracking). In the future, this
// should be replaced—or augmented—with a condition whose value is driven by actual
// device health state rather than allocation alone.
cond := metav1apply.Condition().
    WithType("Allocated").
    WithStatus(metav1.ConditionTrue).
    WithReason("GPUDeviceAllocated").
    WithMessage("GPU device allocated for this request").
    WithLastTransitionTime(metav1.Now())

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer to either omit this condition entirely or create all of the scaffolding around driving the condition asynchronously. I don't think we should be implementing patterns that we don't recommend, even if we document the right way.

If we decide to omit the condition here, let's open an issue to track adding async conditions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sounds good. than i will remove it for now as it requires more work and i will open an issue and work on it.


opts := metav1.ApplyOptions{FieldManager: consts.DriverName, Force: true}

_, err := s.config.coreclient.ResourceV1().ResourceClaims(ns).ApplyStatus(ctx, claim, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some lightweight verification we can add to the e2e tests? At least something to verify that something like the condition or one of the attributes is set on even one of the examples would make sure this doesn't totally break later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Comment on lines 409 to 412
if d.Attributes != nil {
attributes := d.Attributes

if uuid, ok := attributes["uuid"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indexing into a nil map doesn't panic, so we can skip this check. I at least don't see any issues skipping it if I remove all of the attributes from the devices and allocate them.

Suggested change
if d.Attributes != nil {
attributes := d.Attributes
if uuid, ok := attributes["uuid"]; ok {
if uuid, ok := d.Attributes["uuid"]; ok {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

attributes := d.Attributes

if uuid, ok := attributes["uuid"]; ok {
deviceInfo["uuid"] = uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look at this, I'm not sure I see the value in propagating attributes that already exist in the ResourceSlice. Is there a common use case where deriving these attributes from the ResourceSlice doesn't work?

Or is there some other data we can use here? Maybe like the time-slicing/space-partitioning config? I could see that being useful to model the opaque config being like a request, then the device status would represent the response in cases where a given config might result in several different outcomes. Kind of like the regular spec/status model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. i need to see whats more valuable to add here. My thought was to have this as an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I suppose attributes here could be useful in case the device is removed from a ResourceSlice because it became unhealthy or otherwise. Let's keep them here, but can we include a comment explaining when recording the attributes in the status is useful?

}
cond := metav1apply.Condition().
WithType("Ready").
WithStatus(metav1.ConditionTrue).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think for setting device conditions, the interesting part isn't the API call to update the condition, but actually calculating that value of the condition asynchronously. I think I'd rather include that (or no condition at all) than a hardcoded condition added during NodePrepareResources.

@nojnhuh
Copy link
Contributor

nojnhuh commented Dec 3, 2025

FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here!

@guptaNswati
Copy link
Author

FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here!

Oh i see. thank you for the headsup. i will see if i can update this tomorrow or else you can go ahead and merge #129. Main blocker is the e2e, right? we can't merge this without the tests?

@nojnhuh
Copy link
Contributor

nojnhuh commented Dec 4, 2025

@guptaNswati Yes, I would like to include something in the tests for this. #128 (comment) is the other comment I would like to resolve before merging.

@nojnhuh
Copy link
Contributor

nojnhuh commented Dec 5, 2025

@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened.

@guptaNswati
Copy link
Author

@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened.

sorry pushing them rn.

@guptaNswati
Copy link
Author

guptaNswati commented Dec 5, 2025

this is the quick test with updated changes. fixing e2e test

I1205 21:38:59.352921       1 state.go:229] Adding device attribute to claim gpu-test1/pod0-gpu-zfncn
  devices:
  - conditions:
    - lastTransitionTime: "2025-12-05T21:38:59Z"
      message: GPU Device Allocated for this request
      reason: GPUDeviceAllocated
      status: "True"
      type: Allocated
    data:
      driverVersion:
        version: 1.0.0
      model:
        string: LATEST-GPU-MODEL
      uuid:
        string: gpu-657bd2e7-f5c2-a7f2-fbaa-0d1cdc32f81b

@guptaNswati
Copy link
Author

e2e test

 ./test/e2e/e2e.sh 
./test/e2e/e2e.sh: line 1: !/usr/bin/env: No such file or directory
dra-example-driver-cluster
kind-dra-1
NAME                                       STATUS   ROLES           AGE    VERSION
dra-example-driver-cluster-control-plane   Ready    control-plane   104s   v1.34.0
dra-example-driver-cluster-worker          Ready    <none>          89s    v1.34.0
node/dra-example-driver-cluster-worker condition met
Waiting for webhook to be available
resourceclaim.resource.k8s.io/webhook-test created (server dry run)
Webhook is available
namespace/gpu-test1 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod1 created
namespace/gpu-test2 created
resourceclaimtemplate.resource.k8s.io/multiple-gpus created
pod/pod0 created
namespace/gpu-test3 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
namespace/gpu-test4 created
resourceclaim.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod1 created
namespace/gpu-test5 created
resourceclaimtemplate.resource.k8s.io/multiple-gpus created
pod/pod0 created
namespace/gpu-test6 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod0 condition met
pod/pod1 condition met
=== Verifying ResourceClaim device data in namespace gpu-test1 ===
Found ResourceClaim gpu-test1/pod0-gpu-pqv5n, checking status.devices[0].data ...
OK: ResourceClaim gpu-test1/pod0-gpu-pqv5n has device data (uuid=gpu-e7b42cb1-4fd8-91b2-bc77-352a0c1f5747, driverVersion=1.0.0)
Pod gpu-test1/pod0, container ctr0 claimed gpu-4
Pod gpu-test1/pod1, container ctr0 claimed gpu-5
pod/pod0 condition met
Pod gpu-test2/pod0, container ctr0 claimed gpu-0
Pod gpu-test2/pod0, container ctr0 claimed gpu-6
pod/pod0 condition met
Pod gpu-test3/pod0, container ctr0 claimed gpu-1
Pod gpu-test3/pod0, container ctr1 claimed gpu-1
pod/pod0 condition met
pod/pod1 condition met
Pod gpu-test4/pod0, container ctr0 claimed gpu-2
Pod gpu-test4/pod1, container ctr0 claimed gpu-2
pod/pod0 condition met
Pod gpu-test5/pod0, container ts-ctr0 claimed gpu-3
Pod gpu-test5/pod0, container ts-ctr1 claimed gpu-3
Pod gpu-test5/pod0, container sp-ctr0 claimed gpu-7
Pod gpu-test5/pod0, container sp-ctr1 claimed gpu-7
pod/pod0 condition met
Pod gpu-test6/pod0, container init0 claimed gpu-8
Pod gpu-test6/pod0, container ctr0 claimed gpu-8
namespace "gpu-test1" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
pod "pod1" deleted
namespace "gpu-test2" deleted
resourceclaimtemplate.resource.k8s.io "multiple-gpus" deleted
pod "pod0" deleted
namespace "gpu-test3" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
namespace "gpu-test4" deleted
resourceclaim.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
pod "pod1" deleted
namespace "gpu-test5" deleted
resourceclaimtemplate.resource.k8s.io "multiple-gpus" deleted
pod "pod0" deleted
namespace "gpu-test6" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
test ran successfully

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2025
Signed-off-by: Swati Gupta <[email protected]>

address review comment: fix pointer ref

Signed-off-by: Swati Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants